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3 1. FIELD OF THE INVENTION 



5 The present Invention is related to the field of methods and apparatus for 

6 maintaining a consistent file system and for creating read-only copies of the file 

7 system. 

8 

9 2. BACKGROUND ART 



Hi All file systems must maintain consistency In spite of system failure. A 

Z w 

Llii2 number of different consistency techniques have been used in the prior art for this 



Qi3 purpose. 

14 



15 One of the most difficult and time consuming issues in managing any file 

16 server is making backups of file data. Traditional solutions have been to copy the 

17 data to tape or other off-line media. With some file systems, the file server must 

18 be taken off-line during the backup process in order to ensure that the backup is 

19 completely consistent. A recent advance in backup is the ability to quickly "clone" 

20 (i.e., a prior art method for creating a read-only copy of the file system on disk) a 

21 file system, and perform a backup from the clone instead of from the active file 
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1 system. With this type of file system, It allows the file server to remain on-line 

2 during the backup. 



5 '^p 



File System Consistency 



6 A prior art file system is disclosed by Chutani, et al. in an article entitled 

7 The Episode File System, USENIX, Winter 1992, at pages 43-59. The article de- 

8 scribes the Episode file system which is a file system using meta-data (i.e., inode 

9 tables, directories, bitmaps, and indirect blocks). It can be used as a stand-alone 

10 or as a distributed file system. Episode supports a plurality of separate file sys- 



Pii tem hierarchies. Episode refers to the plurality of file systems collectively as an 



«^5i2 "aggregate". In particular, Episode provides a clone of each file system for slowly 

•r »\ 

13 changing data. 



15 In Episode, each logical file system contains an "anode" table. An anode 

16 table is the equivalent of an inode table used in file systems such as the Berkeley 

17 Fast File System. It is a 252-byte structure. Anodes are used to store all user 

18 data as well as meta-data in the Episode file system. An anode describes the 

19 root directory of a file system including auxiliary files and directories. Each such 

20 file system in Episode is referred to as a "fileset". All data within a fileset is locat- 

21 able by iterating through the anode table and processing each file in turn. Epi- 

22 sode creates a read-only copy of a file system, herein referred to as a "clone", 
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1 and shares data with the active file system using Copy-On-Write (COW) tech- 

2 nlques. 



4 Episode uses a logging technique to recover a file system(s) after a system 

5 crashes- Logging ensures that the file system meta-data are consistent. A 

6 bitmap table contains information about whether each block in the file system is 

7 allocated or not. Also, the bitmap table indicates whether or not each block is 

8 logged. All meta-data updates are recorded in a log "container" that stores trans- 

9 action log of the aggregate. The log is processed as a circular buffer of disk 



fto blocks. The transaction logging of Episode uses logging techniques originally 



S 5 » 



5 ;! 



it : 



11 developed for databases to ensure file system consistency. This technique uses 

12 carefully order writes and a recovery program that are supplemented by database 

13 techniques In the recovery program. 

14 

15 Other prior art systems including JFS of IBM and VxFS of Veritas Corpora- 
ls tion use various forms of transaction logging to speed the recover process, but 
17 still require a recovery process. 

18 

19 Another prior art method Is called the "ordered write" technique. It writes 

20 all disk blocks in a carefully determined order so that damage is minimized when 

21 a system failure occurs while performing a series of related writes. The prior art 

22 attempts to ensure that inconsistencies that occur are harmless. 
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1 For instance, a few unused blocks or inodes being marked as allocated. The 

2 primary disadvantage of this technique is that the restrictions it places on disk or- 

3 der make It hard to achieve high performance. 

4 

5 Yet another prior art system is an elaboration of the second prior art 

6 method referred to as an "ordered write with recovery" technique. In this method, 

7 inconsistencies can be potentially harmful. However, the order of writes is re- 
ft stricted so that inconsistencies can be found and fixed by a recovery program. 

13 9 Examples of this method include the original UNIX file system and Bericeley Fast 

#0 File System (FFS). This technique does not reduce disk ordering sufficiently to 
eliminate the performance penalty of disk ordering. Another disadvantage is that 

■ *\ . 

illi2 the recovery process is time consuming. It typically is proportional to the size of 

2s ■ 

013 the file system. Therefore, for example, recovering a 5 GB FFS file system re- 

n s 

Cli4 quires an hour or more to perform. 

V If 

16 File Svstem Clones 

17 

18 Figure 1 is a prior art diagram for the Episode file system illustrating the 

19 use of Copy-On-Write (COW) techniques for creating a fileset clone. Anode 110 

20 comprises a first pointer 11 OA having a COW bit that is set. Pointer 11 OA refer- 

21 ences data block 114 directly. Anode 110 comprises a second pointer 11 OB 

22 having a COW bit that is cleared. Pointer 11 OB of anode references indirect 

23 block 112. Indirect block 112 comprises a pointer 112A that references data 
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1 block 124 directly. The COW bit of pointer 112A is set. Indirect block 112 com- 

2 prises a second pointer 112B that references data block 126. The COW bit of 

3 pointer 1 12B Is cleared. 



f ft*. 



5 A clone anode 120 comprises a first pointer 120A that references data 

6 block 114. The COW bit of pointer 120A is cleared. The second pointer 120B of 

7 clone anode 120 references Indirect block 122. The COW bit of pointer 1208 is 

8 cleared. I n turn, indirect block 122 comprises a pointer 122A that references data 

9 block 124. The COW bit of pointer 122A is cleared. 



11 As illustrated In Figure 1, every direct pointer 110A, 112A-112B, 120A, and 



i|1i2 122A and indirect pointer 11 OB and 120B In the Episode file system contains a 
ai3 cow bit. Blocks that have not been modified since the clone was created are 



14 contained in both the active file system and the clone, and have set (1) COW bits. 

15 The COW bit is cleared (0} when a block that is referenced to by the pointer has 

16 been modified and, therefore, is part of the active file system but not the clone. 

17 

18 When a clone is created in Episode, the entire anode table is copied, along 

19 with all indirect blocks that the anodes reference. The new copy describes the 

20 clone, and the original copy continues to describe the active file system. In the 

21 original copy, the COW bits in all pointers are set to indicate that they point to the 

22 same data blocks as the clone. Thus, when inode 110 in Figure 1 was cloned, it 

23 was copied to clone anode 120, and indirect block 112 was copied to clone indi- 
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1 rect block 122. In addition, COW bit 12A was set to indicate that indirect bloci<s 

2 112 and 122 both point to data block 124. In Figure 1, data block 124 has not 

3 been modified since the clone was created, so it is still referenced by pointers 

4 1 12A and 1 12B. and the COW bit in 1 12A is still set. Data block 126 is not part 

5 of the clone, and so pointer 112B which references it does not have its COW bit 

6 set. 



8 When an Episode clone is created, every anode and every indirect block in 

9 the file system must be copied, which consumes many mega-bytes and takes a 

10 significant amount of time to write to disk. 



{iti2 A fileset "clone" is a read-only copy of an active fileset wherein the active 



13 fileset is readable and writable. Clones are implemented using COW techniques, 

14 and share data blocks with an active fileset on a block-by-block basis. Episode 

15 implements cloning by copying each anode stored in a fileset. When initially 

16 cloned, both the writable anode of the active fileset and the cloned anode both 

17 point to the same data block(s). However, the disk addresses for direct and indi- 

18 rect blocks in the original anode are tagged as COW. Thus, an update to the 

19 writable fileset does not affect the clone. When a COW block is modified, a new 

20 block is allocated in the file system and updated with the modification. The COW 

21 flag in the pointer to this new block is cleared. 

22 
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The prior art Episode system creates clones that duplicate the entire inode 
file and all of the indirect blocks in the file system. Episode duplicates all inodes 
and indirect blocks so that it can set a Copy-On-Write (COW) bit in all pointers to 
blocks that are used by both the active file system and the clone. In Episode, it is 
important to identify these blocks so that new data written to the active file system 
does not overwrite "old" data that is part of the clone and, therefore, must not 
change. 

Creating a clone in the prior art can use up as much as 32 MB on a 1 GB 
disk. The prior art uses 256 MB of disk space on a 1 GB disk (for 4 KB blocks) to 
keep eight clones of the file system. Thus, the prior art cannot use large numbers 
of clones to prevent loss of data. Instead it used to facilitate backup of the file 
system onto an auxiliary storage means other than the disk drive, such as a tape 
backup device. Clones are used to backup a file system in a consistent state at 
the instant the clone is made. By cloning the file system, the clone can be 
backed up to the auxiliary storage means without shutting down the active file 
system, and thereby preventing users firom using the file system. Thus, clones 
allow users to continue accessing an active file system while the file system, in a 
consistent state is backed up. Then the done is deleted once the backup is com- 
pleted. Episode is not capable of supporting multiple clones since each pointer 
has only one COW bit. A single COW bit 30 is not able to distinguish more than 
one clone. For more than one clone, there is no second COW bit that can be set. 
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A disadvantage of the prior art system for creating file system clones is that 
it involves duplicating all of the inodes and all of the indirect blocks in the file 
system. For a system with many small files, tho inodes alone can consume a 
significant percentage of the total disk space in a file system. For example, a 1 
GB file system that is filled with 4 KB files has 32 MB of inodes. Thus, creating 
an Episode clone consumes a significant amount of disk space, and generates 
large amounts (I.e., many megabytes) of disk traffic. As a result of these condi- 
tions, creating a clone of a file system takes a significant amount of time to com- 
plete. 

Another disadvantage of the prior art system is that it makes it difficult to 
create multiple clones of the same file system. The result of this Is that clones 
tend to be used, one at a time, for short temn operations such as backing up the 
file system to tape, and are then deleted. 
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SUMMARY OF THE INVENTiON 

The present invention provides a method for maintaining a file system in a 
consistent state and for creating read-only copies of a file system. Changes to 
the file system are tightly controlled to maintain the file system in a consistent 
state. The file system progresses from one self-consistent state to another self- 
consistent state. The set of self-consistent blocks on disk that is rooted by the 
root inode Is refenred to as a consistency point (CP). To implement consistency 
points, WAFL always writes new data to unallocated blocks on disk. It never 
overwrites existing data. A new consistency point occurs when the fsinfo block is 
updated by writing a new root inode for the inode file into it. Thus, as long as the 
root inode is not updated, the state of the file system represented on disk does 
not change. 

The present invention also creates snapshots, which are virtual read-only 
copies of the file system. A snapshot uses no disk space when it is initially cre- 
ated. It is designed so that many different snapshots can be created for the same 
file system. Unlike prior art file systems that create a clone by duplicating the en- 
tire inode file and ail of the indirect blocks, the present invention duplicates only 
the Inode that describes the inode file. Thus, the actual disk space required for a 
snapshot is only the 128 bytes used to store the duplicated inode. The 128 bytes 
of the present invention required for a snapshot is significantiy less than the many 
megabytes used for a clone in the prior art. 
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The present Invention prevents new data written to the active file system 
from ovenA^riting "old" data that Is part of a snapshot(s). It is necessary that old 
data not be ovenA^rltten as long as it Is part of a snapshot. This Is accomplished 
by using a multi-bit free-block map. Most prior art file systems use a free block 
map having a single bit per block to Indicate whether or not a block Is allocated. 
The present Invention uses a block map having 32-blt entries. A first bit Indicates 
whether a block Is used by the active file system, and 20 remaining bits are used 
for up to 20 snapshots, however, some bits of the 31 bits may be used for other 
purposes. 
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BRIEF DESCRIPTION OF THE DRAWINGS 



Figure 1 1s a block diagram of a prior art "clone" of a file system 



Figure 2 is a diagram illustrating a list of inodes having dirty buffers, 



Figure 3 is a diagram illustrating an on-disk inode of WAFL. 



9 Figures 4A-4D are diagrams illustrating on-disk inodes of WAFL having 



Sio different levels of indirection. 



12 Figure 5 is a flow diagram illustrating the method for generating a consis- 

13 tency point. 



15 Figure 6 is a flow diagram illustrating step 530 of Figure 5 for generating a 

16 consistency point. 

17 

18 Figure 7 is a flow diagram illustrating step 530 of Figure 5 for creating a 

19 snapshot. 

20 

21 Figure 8 is a diagram illustrating an incore inode of WAFL according to the 

22 present invention. 

23 
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1 Figures 9A-9D are diagrams illustrating incore inodes of WAFL having 

2 different levels of indirection according to the present invention. 



11 



1^ if 



14 
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Figure 10 is a diagram illustrating an incore inode 1020 for a file. 



6 Figures 1 1 A-1 1 0 are diagrams illustrating a block map (bikmap) file 

7 according to the present invention. 

8 

9 Figure 12 is a diagram illusfrating an Inode file according to the present 



10 invention. 



J{ii2 Figures 13A-13B are diagrams illustrating an Inode map (Inomap) file 



13 according to the present invention. 



15 Figure 14 is a diagram illustrating a directory according to the present 

16 Invention. 

17 

18 Figure 15 is a diagram illustrating a file system infonnation (fsinfo) 

19 structure. 

20 

21 Figure 16 is a diagram illustrating the WAFL file system. 

22 

23 Figures 17 A-I7L are diagrams illustrating the generation of a consis- 
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1 tency point. 



Figures 18A-18C are diagrams illustrating generation of a snapshot. 



Figure 19 is a diagram illustrating changes to an inode file. 



7 Figure 20 {comprising figures 20A, 20B. and 20C) ,is a diagram 11- 

8 lustrating fsinfo blocks used for maintaining a file system in a consistent state. 



10 Figures 21 A-2 1 F are detailed diagrams illustrating generations of a 

11 snapshot. 



ri|i2 Figure 22 is a diagram illustrating an active WAFL file system hav- 

nil* 



13 ing three snapshots that each reference a common file; and, 



j'J 15 Figures 23A-23B are diagrams Illustrating the updating of a time. 



16 

17 DETAILED DESCRIPTION OF THE PRESENT INVENTION 

18 

19 A system for creating read-only copies of a file system is described. In the 

20 following description, numerous specific details, such as number and nature of 

21 disks, disk block sizes, etc., are described in detail in order to provide a more 

22 thorough description of the present invention. It will be apparent, however, to one 

23 skilled in the art, that the present invention may be practiced without these spe- 
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1 cific details. In other instances, well-known features have not been described in 

2 detail so as not to unnecessarily obscure the present invention. 



a ' j'i" 



s -,1 

> + 

' v^l' - 



WRITE ANYWHERE FILE-SYSTEM LAYOUT 



6 The present invention uses a Write Anywhere File-System Layout (WAFL). 

7 This disk fomnat system is block based (i.e., 4 KB blocks that have no fragments), 

8 uses inodes to describe its files, and includes directories that are simply specially 

9 formatted files. WAFL uses files to store meta-data that describes the layout of 

10 the file system. WAFL meta-data files include: an inode file, a block map 

11 (bikmap) file, and an inode map (inomap) file. The inode file contains the inode 

12 table for the file system. The bikmap file indicates which disk blocks are allo- 

13 cated. The inomap file indicates which inodes are allocated. On-disk and Incore 

14 WAFL inode distinctions are discussed below. 



16 On-Disk WAFL Inodes 



17 



18 WAFL inodes are distinct from prior art inodes. Each on-disk WAFL inode 

19 points to 16 blocks having the same level of indirection. A block number is 4- 

20 bytes long. Use of block numbers having the same level of indirection in an Inode 

21 better facilitates recursive processing of a file. Figure 3 is a block diagram illus- 

22 trating an on-disk inode 310. The on-disk inode 310 is comprised of standard 

23 inode information 31 OA and 16 block number entries 31 OB having the same level 
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1 of indirection. The inode infonnation 310A comprises information about the 

2 owner of a file, permissions, file size, access time, etc. that are well-known to a 

3 person skilled in the art. On-disk inode 310 is unlike prior art inodes that com- 

4 prise a plurality of block numbers having different levels of indirection. Keeping 

5 all block number entries 31 OB in an inode 310 at the same level of indirection 

6 simplifies file system implementation. 



"■•ft 



8 For a small file having a size of 64 bytes or less, data is stored directly in 

9 the Inode itself instead of the 16 block numbers. Figure 4A is a diagram illustrat- 

10 ing a Level 0 inode 410 that is similar to inode 310 shown in Figure 3. However, 

11 inode 410 comprises 64-bytes of data 410B instead of 16 block numbers 310B. 

12 Therefore, disk blocks do not need to be allocated for very small files. 



14 For a file having a size of less than 64 KB, each of the 16 block numbers 

15 directly references a 4 KB data block. Figure 4B is a diagram illustrating a Level 

16 1 inode 310 comprising 16 block numbers 31 OB. The block number entries 0-15 

17 point to corresponding 4 KB data blocks 420A-42OC. 

18 

19 For a file having a size that is greater than or equal to 64 KB and is less 

20 than 64 MB, each of the 16 block numbers references a single-indirect block. In 

21 turn, each 4 KB single-indirect block comprises 1024 block numbers that refer- 

22 ence 4 KB data blocks. Figure 4C is a diagram illustrating a Level 2 inode 310 

23 comprising 16 block numbers 31 OB that reference 16 single-indirect blocks 430A- 
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430C. As shown in Figure 4C, block number entry 0 points to single-indirect 
block 430A. Single-indirect block 430A comprises 1024 block numbers that ref- 
erence 4 KB data blocks 440A-440C. Similarly, single-indirect blocks 430B- 
430C can each address up to 1 024 data blocks. 

For a file size greater than 64 MB, the 16 block numbers of the inode 
reference double-indirect blocks. Each 4 KB double-Indirect block comprises 
1024 block numbers pointing to corresponding single-indirect blocks. In turn, 
each single-indirect block comprises 1024 block numbers that point to 4 KB data 
blocks. Thus, up to 64 GB can be addressed. Figure 4D is a diagram illustrating 
a Level 3 inode 310 comprising 16 block numbers 31 OB wherein block number 
entries 0, 1, and 15 reference double-indirect bk>cks 470A, 470B, and 470C, re- 

* 

spectlvely. Double-indirect block 470A comprises 1024 block number entries 0- 
1023 that point to 1024 single-indirect blocks 480A-480B. Each single-indirect 
block 480A-480B, in turn, references 1024 data blocks. As shown In Figure 4D, 
single-indirect block 480A references 1024 data blocks 490A-490C and single- 
indirect block 480B references 1024 data blocks 49OC-490F. 

18 

19 Incore WAFL I nodes 

20 

21 Figure 8 is a block diagram illustrating an incore WAFL inode 820. The in- 

22 core inode 820 comprises the information of on-disk inode 310 (shown in Figure 

23 3), a WAFL buffer data structure 820A, and 16 buffer pointers 820B. A WAFL in- 
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1 buffers. When all dirty buffers referenced by an Inode are either scheduled to be 

2 written to disk or are written to disl<, the count of dirty buffers to inode 1010 is set 

3 to zero. The inode 1010 is then requeued according to its flag (i.e., no dirty buff- 

4 ers). This inode 1010 is cleared before the next inode is processed. Further the 

5 flag of the inode Indicating that it is in a consistency point is cleared. The Inode 

6 1010 itself Is written to disk in a consistency point. 

7 

8 The WAFL buffer structure Is illustrated by indirect WAFL buffer 1020. 

0 9 WAFL buffer 1020 comprises a WAFL buffer data structure 1020A, a 4 KB buffer 

1 10 1020B comprising 1024 WAFL buffer pointers and a 4 KB buffer 1020C compris- 

Ing 1024 on-disk block numbers. The WAFL buffer data structure is 56 bytes in 

S li 

™ 

K 12 size and comprises 2 pointers. One pointer of WAFL buffer data structure 1020A 

il 11!? 

■X 

C1 13 references 4 KB buffer 1020B and a second pointer references buffer 1020C. In 

C1 14 Figure 10, the 16 buffer pointers 1010C of WAFL inode 1010 point to the 16 sin- 

■ lit 

5f 15 gle-indirect WAFL buffers 1020-1024. In turn, WAFL buffer 1020 references 

16 1024 direct WAFL buffer structures 1030-1034. WAFL buffer 1030 is representa- 

17 tive direct WAFL buffers. 

18 

19 Direct WAFL buffer 1030 comprises WAFL buffer data structure 1030A and 

20 a 4 KB direct buffer 1030B containing a cached version of a corresponding on- 

21 disk 4 KB data block. Direct WAFL buffer 1030 does not comprise a 4 KB buffer 

22 such as buffer 1 020C of indirect WAFL buffer 1 020. The second buffer pointer of 

23 WAFL buffer data structure 1030A is zeroed, and therefore does not point to a 
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1 second 4 KB buffer. This prevents inefficient use of memory because memory 

2 space would be assigned for an unused buffer othenwise. 



4 In the WAFL fiie system as shown in Figure 10, a WAFL in-core inode 

5 structure 1010 references a tree of WAFL buffer structures 1020-1024 and 1030- 

6 1 034. it is similar to a tree of blocks on disk referenced by standard inodes com- 

7 prising block numbers that pointing to Indirect and/or direct blocks. Thus, WAFL 

8 inode 1010 contains not only the on-disk Inode 1010D comprising 16 volume 

9 block numbers, but also comprises 16 buffer pointers 1010C pointing to WAFL 



3 10 buffer structures 1020-1024 and 1030-1034. WAFL buffers 1030-1034 contain 



% 11 cached contents of blocks referenced by volunfie block numbers, 



Q 13 The WAFL in-code inode 1010 contains 16 buffer pointers 1010C. In turn, 



14 the 16 buffer pointers 1010C are referenced by a WAFL buffer structure 101 OB 

15 that roots the tree of WAFL buffers 1020-1024 and 1030-1034. Thus, each 

16 WAFL Inode 1010 contains a WAFL buffer structure 1010B that points to the 16 

17 buffer pointers 1010C in the inode 1010. This facilitates algorithms for handling 

18 trees of buffers that are implemented recursively. If the 16 buffer pointers 101OC 

19 In the inode 1010 were not represented by a WAFL buffer structure 101 OB, the 

20 recursive algorithms for operating on an entire tree of buffers 1020-1024 and 

21 1 030-1 034 would be difficult to implement. 

22 

23 Figures 9A-9D are diagrams illustrating inodes having different levels of in- 
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1 direction. In Figures 9A-9D, simplified indirect and direct WAFL buffers are illus- 

2 trated to show indirection. However, it should be understood that the WAFL buff- 

3 ers of Figure 9 represent corresiponding indirect and direct buffers of Figure 10. 

4 For a small file having a size of 64 bytes or less, data is stored directly in tiie 

5 inode itself instead of the 1 6 buffer pointers. Figure 9A is a diagram illustrating a 

6 Level 0 inode 820 that is the same as inode 820 shown in Figure 8 except that 

7 inode 820 comprises 64-bytes of data 920B instead of 16 buffer pointers 820B. 

8 Therefore, additional buffers are not allocated for very small files. 



#10 For a file having a size of less than 64 KB, each of the 16 buffer pointers di- 



f.p. 

■ r ■ 



11 rectiy references a 4 KB direct WAFL buffer. Figure 9B is a diagram illustrating a 



If 12 Level I inode 820 comprising 16 buffer pointers 820B. The buffer pointers PTRO- 



it '!;?!^ 

s ■■ -i; 

in 



13 PTR1 5 point to corresponding 4 KB direct WAFL buffers 922A-922C. 



14 



15 For a file having a size that is greater than or equal to 64 KB and is less 

16 than 64 MB, each of the 16 buffer pointers references a single-indirect WAFL 

17 buffer. In turn, each 4 KB single-indirect WAFL buffer comprises 1024 buffer 

18 pointers that reference 4 KB direct WAFL buffers. Figure 9C is a diagram illus- 

19 trating a Level 2 inode 820 comprising 16 buffer pointers 820B that reference 16 

20 single-Indirect WAFL buffers 930A-930C. As shown In Figure 9C, buffer pointer 

21 PTRO points to single-indirect WAFL buffer 930A. 

22 Single-indirect WAFL buffer 930A comprises 1024 pointers that reference 
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1 4 KB direct WAFL buffers 940A-940C. Similarly, single-indirect WAFL buffers 

2 930B-930C can each address up to 1 024 direct WAFL buffers. 



rfF^ It 

- ^^f^iPt' 

r 

J*' »f 



4 For a file size greater than 64 MB, the 16 buffer pointers of the inode refer- 

5 ence doubfe-indirect WAFL buffers. Each 4 KB double-indirect WAFL buffer 

6 comprises 1024 pointers pointing to corresponding single-Indirect WAFL buffere. 

7 In turn, each single-indirect WAFL buffer comprises 1024 pointers that point to 

8 4KB direct WAFL buffers. Thus, up to 64 GB can be addressed. Figure 9D is a 

9 diagram illustrating a Level 3 Inode 820 comprising 16 pointers 820B wherein 

10 pointers PTRO, PTR1 , and PTR15 reference double-Indirect WAFL buffers 970A, 

11 970B, and 970C, respectively. 

12 Double-indirect WAFL buffer 970A comprises 1024 pointers that point to 1024 

13 single-indirect WAFL buffers 980A-980B. Each single-indirect WAFL buffer 

14 980A-980B, in turn, references 1024 direct WAFL buffers. As shown In Figure 



S 15 9D, single-indirect WAFL buffer 980A references 1 024 direct WAFL buffers 990A- 

'<* Si^'i 

16 990C and single-indirect WAFL buffer 980B references 1024 direct WAFL buffers 

17 990D-990F. 

18 
19 
20 
21 
22 
23 
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1 Directories 



3 Directories in the WAFL system are stored in 4 KB blocks tliat are 

4 divided into two sections. Figure 14 is a diagram illustrating a directory block 

5 1410 according to the present invention. Each directory block 1410 comprises a 

6 first section 1410A comprising fixed length directory entry structures 1412-1414 

7 and a second section 1410B containing the actual directory names 1416-1418. 

8 Each directory entry also contains a file id and a generation. This information 
O 9 identifies what file the entry references. This infomnation is well-known In the art, 



* 10 and therefore Is not illustrated in Figure 14. Each entry 1412-1414 in the first 



^ ^ ^ 



1 11 section 141 OA of the directory block has a pointer to its name in the second sec- 



12 tion 1410B. Further, each entry 1412-1414 includes a hash value dependent 

13 upon its name in the second section 141 OB so that the name is examined only 

14 when a hash hit (a hash match) occurs. For example, entry 1412 of the first sec- 

15 tion 1410A comprises a hash value 1412A and a pointer 1412B. The hash value 

16 1412A is a value dependent upon the directory name "DIRECTORY_ABC" 

17 stored In variable length entry 1416 of the second section 1410B. Pointer 1412B 

18 of entry 1410 points to the variable length entry 1416 of second section 1410B. 

19 Using fixed length directory entries 1412-1414 in the first Section 1410A speeds 

20 up the process of name lookup. A calculation is not required to find the next entry 

21 in a directory block 1410. Further, keeping entries 1412-1414 in the first section 

22 small 141 OA improves the hit rate for file systems with a line-fill data cache. 

23 
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1 Meta-Data 



3 WAFL keeps information that describes a file system in files known as 

4 meta-data. Meta-data comprises an inode file, inomap file, and a bikmap file. 

5 WAFL stores its meta-data in files that may be written anywhere on a disk. 

6 Because all WAFL meta-data is kept in files, it can be written to any location just 

7 like any other file in the file system. 

8 

Q 9 A first meta-data file is the "inode file" that contains inodes describing, all 

i§. 10 other files in the file system. Figure 12 is a diagram illustrating an inode file 1210. 

if 

■ HI* n 

11 The inode file 1210 may t>e written anywhere on a disk unlike prior art systems 

12 that write "Inode tables" to a fixed location on disk. The inode file 1210 contains 

13 an inode 1210A-1210F for each file in the file system except for the inode file 

14 1210 itself. The inode file 1210 Is pointed to by an inode referred to as the "root 

15 inode". The root inode Is kept in a fixed location on disk referred to as the file 

16 system information (fsinfo) block described below. The inode file 1210 itself is 

17 stored In 4 KB blocks on disk (or 4 KB buffers In memory). Figure 12 illustrates 

18 that inodes 1210A-1210C are stored in a 4 KB buffer 1220. For on-dlsk inode 

19 sizes of 128 bytes, a 4 KB buffer (or block) comprises 32 inodes. The incore 

20 inode file 1210 is composed of WAFL buffers 1220. When an incore inode (i.e., 

21 820) is loaded, the on-disk inode part of the incore inode 820 is copied from the 

22 buffer 1220 of the inode !lie 1210. The buffer data itself is loaded from disk. 

23 Writing data to disk is done in the reverse order. The incore inode 820, which 
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1 contains a copy of the ondisk inode, is copied to the con-esponding buffer 1220 of 

2 the inode file 1210. Then, the inode file 1210 is write-allocated, and the data 

3 stored in the buffer 1220 of the inode file 1210 is written to disk. 



5 Anoflier meta-data file is the "block map" (bikmap) file. Figure 1 1 A is a 

6 diagram illustrating a bikmap file 1110. The bikmap file 1110 contains a 32-bit 

7 entry 1110A-1110D for each 4 KB block In the disk system. It also serves as a 

8 free-block map file. The bikmap file 1110 indicates whether or not a disk block 

9 has been allocated. Figure 11B is a diagram of a block entry 1110A of bikmap 



5 11 

■■'rp r? 



1 10 file 1110 (shown in Figure 11 A). As shown in Figure 11B, entry 1110A Is com- 



11 prised of 32 bits (BITO-BIT31 ). Bit 0 (BITO) of entry 1 1 10A is the active file sys- 

12 tern bit (F5-BIT). The FS-bit of entry 1110A indicates whether or not the corre- 

13 spending block is part of the active file system. Bits 1-20 (BIT1-BIT20) of entry 

14 1 1 1 0A are bits that indicate whether the block is part of a corresponding snapshot 

15 1-20. The next upper 10 bits (BIT21-BIT30) are reserved. Bit 31 (BIT3I) is the 

16 consistency point bit (CP-BIT) of entry 111 OA. 

17 

18 A block is available as a free block in the file system when all bits (BITO- 

19 BIT31) in the 32-bit entry 1110A for the block are clear {reset to a value of 0). 

20 Figure 1 1C is a diagram illustrating entry 1 1 10A of Figure 1 1A indicating the disk 

21 block is free. Thus, the block referenced by entry 1110A of bikmap file 1110 is 

22 free when bits 0-31 (BIT0-BIT31 ) all have values of 0. 

23 Figure 1 1D is a diagram illustrating entry 1 1 10A of Figure 1 1A indicating an 



Express Mailing EL 734 815 746 US 



24 



103.1002.12 



1 allocated block In the active file system. When bit 0 (BITO), also referred to as 

2 the FS-bit, is set to a value of 1, the entry 1110A of bikmap file 1110 indicates a 

3 block that Is part of the active file system. Bits 1-20 (BIT1-BIT20) are used to in- 

4 dicate con-esponding snapshots, if any, that reference the block. Snapshots are 

5 described in detail below. If bit 0 (BITO) is set to a value of 0, this does not nec- 

6 essarily indicate that the block is available for allocation. All the snapshot bits 

7 must also be zero for the block to be allocated. Bit 31 (BIT31) of entry 1 1 10A al- 

8 ways has the same state as bit 0 (BITO) on disk, however, when loaded into 
O 9 memory bit 31 (BIT31 ) is used for bookkeeping as part of a consistency point. 



0 13 file 1310 contains an 8-bit entry 1310A-1310C for each block in the inode file 
p 14 1210 shown in Figure 12. Each entry 1310A-131OC is a count of allocated ino- 



If 15 des in the conresponding block of the inode file 1210. Figure 13A shows values 

■r flr 

16 of 32, 5, and 0 in entries 1310A-1310C, respectively. The inode file 1210 must 

17 still be inspected to find which inodes in the block are free, but does not require 

18 large numbers of random blocks to be loaded into memory from disk. Since each 

19 4 KB block 1220 of inode file 1210 holds 32 inodes, the 8-bit inomap entry 

20 1310A-1310C for each block of inode file 1210 can have values ranging from 0 to 

21 32. When a block 1220 of an inode file 1210 has no inodes in use, the entry 

22 1310A-1310C for it in inomap file 1310 is 0. When all the inodes in the block 
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Another meta-data file is the "inode map" (inomap) file that serves as a free 
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1 1220 mode file 1210 are in use, the entry 1310A-1310C of the inomap file 1310 



2 has a value of 32. 



3 



4 Figure 13B is a diagram illustrating an inomap file 1350 that references 



5 the 4 KB blocks 1340A-1340C of inode file 1340. For example, inode file 1340 



6 stores 37 inodes in three 4 KB blocks 1340A-1340C. Blocks 1340A-1340C of 



7 inode file 1340 contain 32, 5, and 0 used inodes, respectively. Entries 1350A- 



8 1350C of bikmap file 1350 reference blocks 1340A-1340C of inode file 1340, re- 



fi 9 spectively. Thus, the entries 1350A-1350C of inomap file have values of 32, 5, 
10 and 0 for blocks 1340A-1340C of Inode file 1340. In tum. entries 1350A-1350C 



11 of momap file indicate 0, 27, and 32 free inodes in blocks 1340A-1340C of inode 

12 file 1340, respectively. 



C1 14 Referring to Figure 13, using a bitmap for the entries 1310A-1310C of 

9 15 inomap file 1310 instead of counts Is disadvantageous since it would require 4 

zi. is 

16 bytes per entry 1 31 OA-1 31 OC for block 1 220 of the inode file 1 21 0 (shown in 



17 Figure 12) instead of one byte. Free inodes in the block(s) 1220 of the inode 



18 file 1210 do not need to be indicated in the inomap file 1310 because the inodes 



19 themselves contain that infonnation. 



20 



21 Figure 15 is a diagram illustrating a file system information (fsinfo) structure 



22 1510. The root inode 151 OB of a file system is kept in a fixed location on disk so 



23 that it can be located during booting of the file system. The fsinfo block is not a 
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1 meta-data file but Is part of the WAFL system. The root Inode 151 OB is an Inode 

2 referencing the Inode file 1210. It is part of the file system Information (fslnfo) 

3 structure 1510 tfiat also contains Information 1510A including the number of 

4 blocks In the file syst«=*m, the creation time of the file system, etc. The miscella- 

5 neous Information 151 OA further comprises a checksum 1510C (described be- 

6 low). Except for the root Inode 151 OB itself, this information 151 OA can be kept in 

7 a meta-data file in an altemate embodiment. Two identical copies of the fslnfo 

8 structure 1510 are kept In fixed locations on disk. 



9 



•l ~ ? 

u ^ v< 



10 Figure 16 is a diagram illustrating the WAFL file system 1670 in a consis- 

11 tent state on disk comprising two fslnfo blocks 1610 and 1612, Inode file 1620, 

12 bikmap file 1630, inomap file 1640, root directory 1650, and a typical file (or dl- 

13 rectory) 1660. Inode file 1620 is comprised of a plurality of Inodes 1620A-1620D 



0 14 that reference other files 1630-1660 In the file system 1670. Inode 1620A of 



'-'i 



15 inode file 1620 references bikmap file 1630. Inode 1620B references inomap file 

16 1640. Inode 1620C references root directory 1650. Inode 1620D references a 

17 typical file (or directory) 1660. Thus, the inode file points to ail files 1630-1660 In 

18 the file system 1670 except for fslnfo blocks 1610 and 1612. Fslnfo blocks 1610 

19 and 1612 each contain a copy 1610B and 1612B of the inode of the inode file 

20 1620, respectively. Because the root inode 161 OB and 1612B of fslnfo blocks 

21 1610 and 1612 describes the inode file 1620, that in turn describes the rest of the 

22 files 1630-1660 in the file system 1670 including all meta-data files 1630-1640, 

23 the root inode 161 OB and 1612B is viewed as-the root of a tree of blocks. The 
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1 WAFL system 1670 uses this tree structure for its update method (consistency 

2 point) and for implementing snapshots, both described below. 



^ 'A 



f^'^ 13 

rf *\ 



■ I* *\ 



List of Inodes Having Dirty Blocks 



6 WAFL in-core inodes (i.e., WAFL inode 1010 shown in Figure 10) of the 

7 WAFL file system are maintained in different linked lists according to their status. 

8 Inodes that reference dirty blocks are kept in a dirty inode list as shown in Figure 

9 2. Inodes containing valid data that is not dirty are kept in a separate list and ino- 



0 10 des that have no valid data are kept in yet another, as is well-known in the art, 



11 The present invention utilizes a list of Inodes having dirty data blocks that facili- 

12 tates finding all of the inodes that need write allocations to be done. 



14 Figure 2 is a diagram illustrating a list 210 of dirty inodes according to the 

15 present invention. The list 210 of dirty Inodes comprises WAFL in-core inodes 

16 220-1750. As shown In Figure 2, each WAFL in-core inode 220-250 comprises a 

17 pointer 220A-250A, respectively, that points to another inode in the linked list. For 

18 example, WAFL inodes 220-250 are stored In memory at locations 2048, 2152, 

19 2878, 3448 and 3712, respectively. Thus, pointer 220A of inode 220 contains 

20 address 2152. It points therefore to WAFL Inode 222. In turn, WAFL Inode 222 

21 points to WAFL inode 230 using address 2878. WAFL inode 230 points to WAFL 

22 Inode 240. WAFL inode 240 points to Inode 250. The pointer 250A of WAFL 

23 Inode 250 contains a null value and therefore does not point to another inode. 
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1 Thus, it is the last inode in the list 210 of dirty inodes. Each inode in the list 210 

2 represents a file comprising a tree of buffers as depicted in Figure 10. At least 

3 one of the buffers referenced by each inode 220-250 is a dirty buffer. A dirty 

4 buffer contains modified data that must be written to a new disk location in the 

5 WAFL system. WAFL always writes dirty buffers to new locations on disk. 



7 
8 



Tt « H 

fir 

I III i« 
I* r* i» 



CONSISTENCY POINTS 



9 The WAFL disk structure described so far is static. In the present inven- 

10 tion, changes to the file system 1670 are tightly controlled to maintain the file 



11 system 1670 in a consistent state. The file system 1670 progresses from one 

12 self-consistent state to another self-consistent state. The set (or tree) of self- 

13 consistent blocks on disk that is rooted by the root inode 151 OB is referred to as a 
consistency point (CP). To implement consistency points. WAFL always writes 



15 new data to unallocated blocks on disk. It never overwrites existing data. Thus, 

tr - 

B 16 as long as the root inode 1510B is not updated, the state of the file system 1670 

17 represented on disk does not change. However, for a file system 1670 to be 

18 useful, it must eventually refer to newly written data, therefore a new consistency 

19 point must be written. 

20 

21 Referring to Figure 16, a new consistency point is written by first flushing all 

22 file system blocks to new locations on disk (including the blocks in meta-data files 



23 



such as the inode file 1620, bikmap file 1630, and inomap file 1640). A new root 
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1 inode 161 OB and 1612B for the file system 1670 is then written to 6\sk. With this 

2 method for atomically updating a file system, the on-disk file system is never in- 

3 consistent. The on-disk file system 1670 reflects an old consistency point up until 

4 the root inode 161 OB and 1612B is written. Immediately after the root Inode 

5 1610B and 1612B is written to disk, the file system 1670 reflects a new consis- 

6 tency point. Data structures of the file system 1670 can be updated in any order. 

7 and there are no ordering constraints on disk writes except the one requirement 
6 that all blocks in the file system 1670 must be written to disk before the root inode 



9 1610B and 1612B is updated. 



10 



S 11 To convert to a new consistency point, the root inode 161 OB and 1612B 

I" 

1 12 must be updated reliably and atomically. WAFL does this by keeping two identi- 
cal copies of the fsinfo structure 1610 and 1612 containing the root inode 161 OB 
and 1612B. During updating of the root inode 1610B and 1612B, a first copy of 

15 the fsinfo structure 1 61 0 is written to disk, and then the second copy of the fsinfo 

16 structure 1612 is written. A checksum 1610C and 1612C in the fsinfo structure 



ftr- 



17 1610 and 1612, respectively, is used to detect the occurrence of a system crash 

18 that corrupts one of the copies of the fsinfo structure 1610 or 1612, each con- 

19 taining a copy of the root inode, as it is being written to disk. Normally, the two 

20 fsinfo structures 1 61 0 and 1 61 2 are identical. 

21 
22 
23 
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2 



Algorithm for Generating a Consistency Point 

3 Figure 5 is a diagram illustrating the method of producing a consistency 

4 point. In step 510, all "dirty" inodes (inodes that point to new blocks containing 

5 modified data) in the system are marked as being in the consistency point. Their 

6 contents, and only their contents, are written to disk. Only when those writes are 

7 complete are any writes from other inodes allowed to reach disk. Further, during 

8 the time dirty writes are occurring, no new modifications can be made to inodes 

9 that have their consistency point flag set. 



if ? 

J!^ 10 



fl! 



PI 14 



it-- 



19 



20 



In addition to setting the consistency point flag for all dirty inodes that are 



i 12 part of the consistency point, a global consistency point flag is set so that user- 
is requested changes behave in a tightly controlled manner. Once the global con- 
sistency point flag is set, user-requested changes are not allowed to affect inodes 
that have their consistency point flag set. Further, only inodes having a consis- 



16 tency point flag that is set are allocated disk space for their dirty blocks. Conse- 

17 quently, the state of the file system will be flushed to disk exactly as it was when 

18 the consistency point began. 



In step 520, regular files are flushed to disk. Flushing regular files com- 

21 prises the steps of allocating disk space for dirty blocks in the regular files, and 

22 writing the con-esponding WAFL buffers to disk. The inodes themselves are then 

23 flushed (copied) to the inode file. All inodes that need to be written are in either 
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1 the list of inodes having dirty buffers or the list of inodes that are dirty but do not 



2 have dirty buffers. When step 520 is completed, there are no more ordinary ino- 



3 des with the consistency point flag set, and ail incoming I/O requests succeed 



4 unless the requests use buffers that are still locked up for disk I/O operations. 



5 



6 in step 530, special files are flushed to disk. Flushing special files 



7 comprises the steps of allocating disk space for dirty blocks in the two special 



8 files: the inode file and the bikmap file, updating the consistency bit (CP-bit) to 



CI 9 match the active file system bit (FS-bit) for each entry in the bikmap file, and then 

§ 10 writing the blocks to disk. Write allocating the inode file and the bikmap is com- 

% 11 plicated because the process of write allocating them changes the files them- 

lH 12 selves. Thus, in step 530 writes are disabled whWe changing fiiese files to pre- 

P*lt ■ 

O 13 vent important blocks from locking up in disk I/O operations before the changes 

O 14 are completed. 

H 

16 Also, in step 530, the creation and deletion of snapshots, described below, 



17 are performed because it is the only point in time when the file system, except for 



18 the fsinfo block, Is completely self consistent and about to be written to disk. A 



19 snapshot is deleted from the file system before a new one is created so that the 



20 same snapshot inode can be used in one pass. 



21 



22 Figure 6 is a flow diagram illustrating the steps that step 530 comprises. 



23 Step 530 allocates disk space for the bikmap file and the inode file and copies the 
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1 active FS-bit into the CP-bit for each entry in the bikmap file. In step 610, the 



2 inode for the bll<map file is pre-flushed to the inode file. This ensures that the 



3 block In the inode file that contains the inode of the bikn :ap file is dirty so that step 



4 620 allocates disk space for it. 



5 



6 In step 620, disk space is allocated for all dirty blocks in the inode and 



7 bikmap files. The dirty blocks include the block in the Inode file containing the 



8 inode of the bikmap file. 



ST ^ 

3 10 In step 630, the inode for the bikmap file is flushed again, however this time 

'St '< 

|S 11 the actual Inode Is written to the pre-flushed block in the inode file. 

Step 610 has already dirtied the block of the Inode file that contains the inode of 
h 13 the bikmap file. Thus, another write-allocate, as in step 620, does not need to be 

XT ? 

is 

0 14 scheduled. 

■ H 

W 15 

5 ; Is 



16 In step 640, the entries for each block in the bSkmap file are updated. Each 

17 entry is updated by copying the active FS-bit to the CP-bit (i.e., copying bit 0 into 

18 bit 31 ) for all entries in dirty blocks in the bikmap file. 

19 

20 In step 650, all dirty blocks in the bikmap and inode files are written to disk. 

21 Only entries in dirty blocks of the bikmap file need to have the active file system 

22 bit (FS-bit) copied to the consistency point bit (CP-bit) in step 640. 
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1 Immediately after a consistency point, ail bikmap entries have same value for 

2 both the active FS-bit and CP-bit. As time progresses, some active FS-bits of 

3 bikmap file entries for the file system are either cleared or set. The blocks of the 

4 bikmap file containing the changed FS-bits are accordingly marked dirty. During 

5 the following consistency point, blocks that are clean do not need to be re-copled. 

6 The clean blocks are not copied because they were not dirty at the previous con- 

7 sistency point and nothing in the blocks has changed since then. Thus, as long 

8 as the file system is initially created with the active FS-bit and the CP-bit having 
ei 9 the same value in all bikmap entries, only entries with dirty blocks need to be up- 
I1 10 dated at each consistency point. 

J Z 1^ 

" 

S 12 Refemng to Figure 5, in step 540, the file system information (fslnfo) block 

Q 13 is first updated and then fiushed to disk. The fslnfo block is updated by writing a 

J ? ;t 

O 14 new root inode for the inode file into it. The fsinfo block Is written twice. It Is first 

■ 

If 15 written to one location and then to a second location. The two writes are per- 

-'r rtf* 

16 formed so that when a system crash occurs during either write, a self^conslstent 

17 file system exists on disk. Therefore, either the new consistency point is avail- 

18 able if the system crashed while writing the second fsinfo block or the previous 

19 consistency point (on disk before the recent consistency point began) is available 

20 if the first fsinfo block failed. When the file system is restarted after a system fail- 

21 ure, the highest generation count for a consistency point in the fsinfo blocks hav- 

22 ing a correct checksum value is used. This is described in detail below. 

23 
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In step 550, the consistency point is completed. This requires that any dirty 
inodes that were delayed because they were not part of the consistency point be 
requeued. Any inodes that had their state change during the consistency point 
are in the consistency point wait {CP_WAIT) queue. The CP_WAIT queue holds 
inodes that changed before step 540 completed, but after step 510 when the 
consistency point started. Once the consistency point is completed, the inodes in 
the CP_WAIT queue are re-queued accordingly In the regular list of inodes with 
dirty buffers and list of dirty inodes without dirty buffers. 
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20 



21 



23 



Single Ordering Constraint of Consistency Point 



3 The present invention, as illustrated in Figures 20A-20C, has a single or- 

4 dering constraint. The single ordering constraint is that the fsinfo block 1810 Is 

5 written to disk only after all the other blocks are written to disk. The writing of the 

6 fsinfo block 1810 is atomic, otherwise the entire file system 1830 could be lost. 

7 Thus, the WAFL file system requires the fsinfo block 1810 to be written at once 

8 and not be in an inconsistent state. As illustrated in Figure 15, each of the fsinfo 
blocks 1810 (1510) contains a checksum 1510C and a generation count 1510D. 



Figure 20A illustrates the updating of the generation count 1810D and 
1870D of fsinfo blocks 1810 and 1870. Each time a consistency point (or snap- 
is shot) is perfomied, the generation count of the fsinfo block is updated. Figure 
20A illustrates two fsinfo blocks 1810 and 1870 having generation counts 1810D 



15 and 1870D, respectively, that have the same value of N indicating a consistency 

16 point for the file system. Both fsinfo blocks reference the previous consistency 

17 point (old file system on disk) 1830. A new version of the file system exists on 

18 disk and is refen-ed to as new consistency point 1831. The generation count is 



19 incremented every consistency point. 



In Figure 20B, the generation count 1810D of the first fsinfo block 1810 is 
22 updated and given a value of N+1 . It is then written to disk. Figure 20B illus- 



trates a value of N+1 for generation count 1810D of fsinfo block 1810 whereas 
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1 the generation count 1870D of the second fslnfo block 1870 has a value of N. 

2 Fslnfo block 1810 references new consistency point 1831 whereas fslnfo block 

3 1870 references old consistency point 1830. Next, the generation count 1870D 

4 of fslnfo block 1870 Is updated and written to disk as Illustrated In Figure 20C. In 

5 Figure 20C, the generation count 1870D of fslnfo block 1870 has a value of N+1. 

6 Therefore the two fslnfo blocks 1810 and 1870 have the same generation count 

7 value of N+1. 

8 

9 When a system crash occurs between fslnfo block updates, each copy of 

10 the fslnfo block 1810 and 1870 will have a self consistent checksum (not shown 



fen In the diagram), but one of the generation numbers 1810D or 1870D will have a 



12 higher value. A system crash occurs when the file system Is in the state lllus- 

13 trated In Figure 20B. For example, In the prefen-ed embodiment of the present 



C1 14 Invention as Illustrated in Figure 20B, the generation count 1810D of fslnfo block 



1.3 



15 1810 is updated before the second fslnfo block 1870. Therefore, the generation 

16 count 1810D (value of one) is greater than the generation count 1870D of fsinfo 

17 block 1870. Because the generation count of the first fsinfo block 1810 Is higher, 

18 it is selected for recovering the file system after a system crash. This is done be- 

19 cause the first fslnfo block 1810 contains more current data as indicated by its 

20 generation count 1810D. For the case when the first fslnfo block is corrupted be- 

21 cause the system crashes while it is being updated, the other copy 1870 of the 

22 fslnfo block is used to recover the file system 1830 into a consistent state. 

23 
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1 It is not possible for both fsinfo blocks 1810 and 1870 to be updated at the 

2 same time in the present invention. Therefore, at least one good copy of the 

3 fsinfo block 1810 and 11^70 exists in the file system. This allows the file system to 

4 always be recovered into a consistent state. 

5 

6 WAFL does not require special recovery procedures. This is unlike prior 

7 art systems that use logging, ordered writes, and mostly ordered writes with re- 

8 covery. This is because only data corruption, which RAID protects against, or 
p 9 software can corrupt a WAFL file system. To avoid losing data when the system 

i'.S'!5 

10 fails, WAFL may keep a non-volatile transaction log of all operations that have 

■> ^ . 

§, 11 occunred since the most recent consistency point. This log is completely inde- 

ff * ^^ 
■ " f ^ 

}|ii2 pendent of the WAFL disk format and is required only to prevent operations from 

C1 13 being lost during a system crash. However, it is not required to maintain consis- 

H ; ii 

O 14 tency of the file system. 

I 

16 Generating A Consistency Point 

17 

18 As described above, changes to the WAFL file system are tightly controlled 

19 to maintain the file system in a consistent state. Figures 17A-I7H illustrate the 

20 generation of a consistency point for a WAFL file system. The generation of a 

21 consistency point is described with reference to Figures 5 and 6. 

22 

23 In Figures 17 A-I7L, buffers that have not been modified do not have 
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1 asterisks beside them. Therefore, buffers contain the same data as correspond- 

2 ing on-disk blocks. Thus, a block may be loaded Into memory but it has not 

3 changed with respect to its on disk version. A buffer with a single asterisk (*) be- 

4 side it indicates a dirty buffer in memory (its data is modified). A buffer with a 

5 double asterisk (**) beside it indicates a dirty buffer that has been allocated disk 

6 space. Finally, a buffer with a triple asterisk (***) is a dirty buffer that is written 

7 Into a new block on disk. This convention for denoting the state of buffers is also 

8 used with respect to Figures 2 1 A-2 1 E. 



11 10 Figure 17A illustrates a list 2390 of inodes with dirty buffers comprising 



J n 



.11 inodes 2306A and 2306B. Inodes 2306A and 2306B reference trees of buffers 



is 



12 where at least one buffer of each free has been modified. Initially, the consis- 

13 tency point flags 2391 and 2392 of inodes 2306A and 2306B are cleared (0). 

14 While a list 2390 of inodes with dirty buffers is illustrated for the present system, it 

15 should be obvious to a person skilled in the art that other lists of inodes may exist 

16 in memory. For instance, a list of inodes that are dirty but do not have dirty buff- 

17 ers is maintained in memory . These inodes must also be marked as being in the 

18 consistency point They must be flushed to disk also to write the dirty contents of 

19 the inode file to disk even though the dirty inodes do not reference dirty blocks. 

20 This is done in step 520 of Figure 5. 

21 

22 Figure 17B is a diagram illustrating a WAFL file system of a previous con- 

23 sistency point comprising fsinfo block 2302, inode file 2346, bikmap file 2344 and 
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1 flies 2340 and 2342. File 2340 comprises blocks 2310-2314 containing data "A", 

2 "B", and "C", respectively. File 2342 comprises data blocks 2316-2320 compris- 

3 ing data "D", "E", and "F", respectively. BIkmap file 2344 comprises block 2324. 

4 The inode file 2346 comprises two 4 KB blocks 2304 and 2306. The second 

5 block 2306 comprises inodes 2306A-2306C that reference file 2340, file 2342, 

6 and bikmap file 2344, respectively. This is illustrated in block 2306 by listing the 

7 file number In the inode. Fsinfo block 2302 comprises the root inode. The root 

8 inode references blocks 2304 and 2306 of inode file 2346. Thus. Figure 17B II- 

9 lustrates a tree of buffers in a file system rooted by the fsinfo block 2302 contain- 



fMO ing the root Inode. 



rtfr 

1 3 

r% 14 



K 12 Figure 17C is a diagram illustrating two modified buffers for blocks 2314 

and 2322 in memory. The active file system is modified so that the block 2314 - 
containing data "C" is deleted from file 2340. Also, the data "F" stored in block 
2320 is modified to "F-prime", and is stored in a buffer for disk block 2322. It 
should be understood that the modified data contained in buffers for disk blocks 
2314 and 2322 exists only in memory at this time. All other blocks in the active 



15 



16 



18 file system in Figure 17C are not modified, and therefore have no asterisks be- 
side them. However, some or all of these blocks may have corresponding clean 



19 



20 buffers in memory. 



21 



22 Figure 17D is a diagram illustrating the entries 2324A-2324M of the bikmap 

23 file 2344 in memory. Entries 2324A-2324M are contained in a buffer for 4 KB 
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1 block 2324 of bikmap file 2344. As described previously, BITO and BIT31 are the 

2 FS-BIT and CP-BIT, respectively. The consistency point bit (CP-BIT) Is set dur- 

3 ing a consistency point to ensure that the corresponding block is not modified 

4 once a consistency point has begun, but not finished. BIT1 is the first snapshot 

5 bit (described below). BIkmap entries 2324A and 2324B Illustrate that, as shown 

6 In Figure 1 7B, the 4 KB blocks 2304 and 2306 of Inode file 2346 are In the active 

7 file system (FS-BIT equal to 1) and in the consistency point (CP-BIT equal to 1). 

8 Similarly, the other blocks 2310-2312 and 2316-2320 and 2324 are in the active 

9 file system and In the consistency point. However, blocks 2308, 2322, and 2326- 
2328 are neither in the active file system nor In the consistency point (as indl- 



>• fir 



11 cated by BITO and BIT31, respectively). The entry for deleted block 2314 has a 

12 value of 0 in the FS-BIT indicating that It has been removed from the active file 



P1 13 system. 



If 



1 14 

if 



15 In step 510 of Figure 5, all "dirty" inodes In the system are marked as being 

16 In the consistency point. Dirty inodes include both inodes that are dirty and ino- 

17 des that reference dirty buffers. Figure 171 illustrates a list of Inodes with dirty 

18 buffers where the consistency point flags 2391 and 2392 of inodes 2306A and 



19 2306B are set (1). Inode 2306A references block 2314 containing data "C" of file 
2340 which is to be deleted from the active file system. Inode 2306B of block 
2306 ot inode file 2346 references file 2342. Block 2320 containing data "F" has 
been modified and a new block containing data "F" must be allocated. This is 11- 



20 



21 



23 lustrated in Figure 17E. 
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1 In step 520, regular files are flushed to disk. Thus, block 2322 is allocated 

2 disk space. Block 2314 of file 2340 Is to be deleted, therefore nothing occurs to 

3 this block until the consistency point is subsequently completed. Block 2322 Is 

4 written to disk in step 520. This Is illustrated in Figure 17F where buffers for 

5 blocks 2322 and 2314 have been written to disk (marked by ***). The intermedi- 

6 ate allocation of disk space (**) Is not shown. The Incore copies of inodes 2308A 

7 and 2308B of block 2308 of inode file 2346 are copied to the inode file. The 

8 modified data exists In memory only, and the buffer 2308 is marked dirty . The in- 
O 9 consistency point flags 2391 and 2392 of inodes 23(^A and 2306B are then 
w 10 cleared (0) as Illustrated in Figure 17A. This releases the inodes for use by other 
!| 11 processes. Inode 2308A of block 2308 references blocks 2310 and 2312 of file 

p* *• M 

M 12 2346. Inode 2308B references blocks 2316, 2318, 2322 for file 2342. As illus- 

p 13 trated in Figure 17F, disk space is allocated for direct block 2322 for file 2342 and 

Ttr^ TV 

H 2 J? ■ 

CI 14 that block is written to disk. However, the file system itself has not been updated. 

15 Thus, the file system remains in a consistent state. 

16 

17 In step 530, the bikmap file 2344 is flushed to disk. This is illustrated in 

18 Figure 17G where the bikmap file 2344 is indicated as being dirty by the asterisk. 

19 

20 In step 610 of Figure 6, the inode for the bikmap file is pre-flushed to the 

21 inode file as illustrated in Figure 17H. Inode 2308C has been flushed to block 

22 230B of inode file 2346. However, inode 2308C still references block 2324. In 

23 step 620, disk space is allocated for bikmap file 2344 and inode file 2346. Block 



Express Mailing EL 734 815 746 US 



42 



103.1002.12 



1 2308 Is allocated for inode file 2346 and block 2326 is allocated for bikmap file 

2 2344. As described above, block 2308 of inode file 2346 contains a pre-flushed 

3 inode 2308C for bikmap file 2344. In step 630, the inode for the bikmap file 2344 

4 is written to the pre-flushed block 2308C in inode 2346. Thus, incore inode 

5 2308C Is updated to reference block 2324 in step 620, and is copied Into the 

6 buffer in memory containing block 2306 that Is to be written to block 2308. This Is 

7 illustrated in Figure 17H where inode 2308C references block 2326. 



8 



315 S 



In step 640, the entries 232^-2326L for each block 2304-2326 in the 
bikmap file 2344 are updated in Figure 17J. Blocks that have not changed since 
11 the consistency point began in Figure 17B have the same values in their entries. 
The entries are updated by copying BITO (FS-bit) to the consistency point bit 



I ■ 
12 



,13 (BIT31). Block 2306 is not part of the active file system, therefore BITO is equal 

14 to zero (BITO was tumed off in step 620 when block 2308 was allocated to hold 

15 the new data for that part of the inode file). This is illustrated in Figure 17J for 

16 entry 2326B. Similarly, entry 2326F for block 2314 of file 2340 has BITO and 
BIT31 equal to zero. Block 2320 of file 2342 and block 2324 of bikmap file 2344 
are handled similariy as shown in entries 2361 and 2326K, respectively. In step 

19 650, dirty block 2308 of inode file 2346 and dirty block 2326 of bikmap file 2344 
are written to disk. This is indicated In Figure 17K by a triple asterisk (***) beside 



17 



18 



20 



21 blocks 2308 and 2326. 



22 
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1 Referring to Figure 6, in step 540, the file system infomiation blocl^ 2302 Is 

2 flushed to disk, this is performed twice. Thus, fsinfo block 2302 is dirtied and 

3 then written to disk (indicated by a triple asterisk) in Figure 17L. In Figure 17L, a 

4 single fsinfo block 2302 is illustrated. As shown in the diagram, fsinfo block 2302 

5 now references block 2304 and 2308 of the inode file 2346. In Figure 17L, block 

6 2306 is no longer part of the inode file 2346 in the active file system. Similarly, 

7 file 2340 referenced by inode 2308A of inode file 2346 comprises blocks 2310 

8 and 2312. Block 2314 is no longer part of file 2340 in this consistency point. File 
I 9 2342 comprises blocks 2316, 2318, and 2322 In the new consistency point 

whereas block 2320 Is not part of file 2342. Further, block 2308 of Inode file 2346 



0 10 



12 



ttix 

- 

'rsssr 1^ 



17 



19 



20 



21 



22 



23 



references a new bikmap file 2344 comprising block 2326. 



U 13 As shown in Figure 17L, in a consistency point, the active file system is up- 

3 14 dated by copying the mode of the inode file 2346 into fsinfo block 2302. How- 
ever, the blocks 2314, 2320, 2324, and 2306 of the previous consistency point 



16 remain on disk. These blocks are never overwritten when updating the file sys- 



tem to ensure that both the old consistency point 1 830 and the new consistency 



18 point 1831 exist on disk in Figure 20 during step 540. 
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SNAPSHOTS 



3 The WAFL system supports snapshots. A snapshot is a read-only copy of 

4 an entire file system at a given Instant when the snapshot is created. A newly 

5 created snapshot refers to exactly the same disk blocks as the active file system 

6 does. Therefore, it is created in a small period of time and does not consume any 

7 additional disk space. Only as data blocks in the active file system are modified 



and written to new locations on disk does the snapshot begin to consume extra 



9 space. 



WAFL supports up to 20 different snapshots that are numbered 1 through 



W 12 20. Thus, WAFL allows the creation of multiple "clones" of the same file system. 

m 

? "rip 

13 Each snapshot is represented by a snapshot inode that is similar to the repre- 
sentation of the active file system by a root inode. Snapshots are created by du- 

15 plicating the root data structure of the file system. In the preferred embodiment. 

16 the root data structure is the root inode. However, any data structure representa- 



17 tive of an entire file system could be used. The snapshot inodes reside in a fixed 

18 location in the inode file. The limit of 20 snapshots is imposed by the size of the 

19 bikmap entries. WAFL requires two steps to create a new snapshot N: copy the 

20 root inode into the inode for snapshot N; and, copy bit 0 into bit N of each bikmap 

21 entry in the bikmap file. Bit 0 indicates the blocks that are referenced by the tree 

22 beneath the root inode. 

23 
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The result is a new file system tree rooted by snapshot inode N that refer- 
ences exactly the same disk blocks as the root inode. Setting a con-esponding bit 
in the bikmap for each block in the snapshot prevents snapshot blocks from being 
freed even if the active file no longer uses the snapshot blocks. Because WAFL 
always writes new data to unused disk locations, the snapshot tree does not 
change even though the active file system changes. Because a newly created 
snapshot tree references exactly the same blocks as the root inode, it consumes 
no additional disk space. Over time, the snapshot references disk blocks that 
would otherwise have been freed. Thus, over time the snapshot and the active 
fife system share fewer and fewer blocks, and the space consumed by the snap- 
shot increases. Snapshots can be deleted when they consume unacceptable 
numbers of disk blocks. 

The list of active snapshots along with the names of the snapshots is 
stored in a meta-data file called the snapshot directory. The disk state is updated 
as described above. As with all other changes, the update occurs by automati- 
cally advancing from one consistency point to another. Modified blocks are writ- 
ten to unused locations on the disk after which a new root inode describing the 
updated file system is written. 
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1 Overview of Snapshots 



it—, ""i 



3 Figure 18A is a diagram of the file system 1830, before a snapshot is 

4 taken, where levels of indirection have been removed to provide a simpler over- 

5 view of the WAFL file system. The file system 1830 represents the file system 

6 1690 of Figure 16. The file system 1830 is comprised of blocks 1812-1820. The 

7 inode of the inode file is contained in fsinfo block 1810. While a single copy of 

8 the fsinfo block 1810 is shown in Figure 18A, it should be understood that a sec- 

9 ond copy of fsinfo block exists on disk. The inode 181 OA contained in the fsinfo 



3 10 block 1810 comprises 16 pointers that point to 16 blocks having the same level of 
11 indirection. The blocks 1812-1820 in Figure 18A represent all blocks in the file 



r; 12 



3 



s fir 



17 



system 1 830 including direct blocks, indirect blocks, etc. Though only five blocks 



13 181 2-1 820 are shown, each block may point to other blocks. 



O 15 Figure 18B is a diagram illustrating the creation of a snapshot. The snap- 

16 shot is made for the entire file system 1 830 by simply copying the inode 1 81 OA of 



the inode file that is stored in fsinfo block 1810 into the snapshot inode 1822. By 



18 copying the inode 181 OA of the inode file, a new file of inodes is created repre- 

19 senting the same file system as the active file system. Because the inode 181 OA 

20 of the inode file itself is copied, no other blocks 1 81 2-1 820 need to be duplicated. 

21 The copied inode or snapshot inode 1822, is then copied into the inode file, which 

22 dirties a block in the inode file. For an inode file comprised of one or more levels 

23 of indirection, each indirect block is in turn dirtied. This process of dirtying blocks 
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1 propagates through all the levels of indirection. Each 4 KB block in the inode file 

2 on disk contains 32 inodes where each inode Is 128 bytes long. 



4 The new snapshot Inode 1822 of Figure 18B points back to the highest 

5 level of indirection blocks 1812-1820 referenced by the inode 1810A of the inode 

6 file when the snapshot 1822 was taken. The Inode file itself Is a recursive struc- 

7 ture because It contains snapshots of the file system 1830. Each snapshot 1822 

8 is a copy of the inode 1 81 OA of the inode file that is copied into the inode file. 



I 10 Figure 18C is a diagram illustrating the active file system 1830 and a snap- 

11 shot 1822 when a change to the active file system 1830 subsequently occurs af- 



f:i; 



jl 12 ter the snapshot 1822 Is taken. As illustrated In the diagram, block 1818 com- 
13 prising data "D" is modified after the snapshot was taken (In Figure 188), and 



s ■ 



P! 14 therefore a new block 1824 containing data "Dprime" is allocated for the active 



■» »\ 



15 file system 1830. Thus, the active file system 1830 comprises blocks 1812-1816 

16 and 1820-1824 but does not contain block 1818 containing data "D". However, 

17 block 1 81 8 containing data "D" is not ovenwrltten because the WAFL system does 

18 not ovenA/rite blocks on disk. The block 1818 is protected against being over- 

19 written by a snapshot bit that is set in the bikmap entry for block 1818. Therefore, 

20 the snapshot 1822 still points to the unmodified block 1818 as well as blocks 

21 1812-1816 and 1820. The present invention, as Illustrated in Figures 18A-18C, is 

22 unlike prior art systems that create "clones" of a file system where a clone is a 

23 copy of all the blocks of an Inode file on disk. Thus, the entire contents of the 
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1 prior art inode files are duplicated requiring large amounts (MB) of disk space as 

2 well as requiring substantial time for disk I/O operations. 



4 As the active file system 1830 is modified in Figure 18C, It uses more disk 

5 space because the file system comprising blocks 1812-1820 is not overwritten. 

6 In Figure 18C, block 1818 is illustrated as a direct block. However, in an actual 

7 file system, block 1818 may be pointed to by indirect block as well. Thus, when 

8 block 1818 is modified and stored in a new disk location as block 1824, the con-e- 

9 spending direct and indirect blocks are also copied and assigned to the active file 



p, 10 system 1830. 



i 12 Figure 19 is a diagram illustrating the changes occuning in block 1824 of 

13 Figure 18C. Block 1824 of Figure 18C is represented within dotted line 1824 in 



i K 



5 'J' 
r. "tip 



P? 14 Figure 19. Figure 19 illustrates several levels of indirection for block 1824 of Fig- 

ci 15 ure 18C. The new block 1910 that is written to disk in Figure 18C is labeled 1910 

16 in Figure 19. Because block 1824 comprises a data block 1910 containing modi- 

17 fied data that is referenced by double Indirection, two other blocks 1918 and 1926 

18 are also modified. The pointer 1924 of single-indirect block 1918 references new 

19 block 1910, therefore block 1918 must also be written to disk in a new location. 

20 Similariy, pointer 1928 of indirect block 1926 is modified because it points to 

21 block 1918. Therefore, as shown in Figure 19, modifying a data block 1910 can 

22 cause several indirect blocks 1918 and 1926 to be modified as well. This re- 

23 quires blocks 1918 and 1926 to be written to disk in a new location as well. 
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Because the direct and indirect blocks 1910. 1918 and 1926 of data block 
1824 of Figure 18C have changed and been written to a new location, the inode 
in the inode file is written lo a new block. The modified block of the inode file is 
allocated a new block on disk since data cannot be overwritten. 



6 As shown In Figure 1 9, block 1 91 0 Is pointed to by Indirect blocks 1 926 and 

7 1918, respectively. Thus when block 1910 is modified and stored in a new disk 

8 location, the corresponding direct and indirect blocks are also copied and as- 
signed to the active file system. Thus, a number of data structures must be up- 



10 dated. Changing direct block 1910 and indirection blocks 1918 and 1926 causes 



7^ 

the bikmap file to be modified. 



13 The key data structures for snapshots are the bikmap entries where each 

14 entry has multiple bits for a snapshot. This enables a plurality of snapshots to be 

15 created. A snapshot is a picture of a tree of blocks that is the file system (1830 of 

16 Figure 18). As long as new data Is not written onto blocks of the snapshot, the 

17 file system represented by the snapshot is not changed. A snapshot Is similar to 

18 a consistency point. 

19 

20 The file system of the present invention is completely consistent as of the 

21 last time the fsinfo blocks 1810 and 1870 were written. Therefore, if power Is in- 

22 temipted to the system, upon restart the file system 1830 comes up In a consis- 

23 tent state. Because 8-32 MB of disk space are used in typical prior art "clone" of 
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a 1 GB file system, clones are not conducive to consistency points or snapshots 



2 as is the present invention. 



4 Refemng to Figure 22, two previous snapshots 21 10A and 21 10B exist on 

5 disk. At the instant when a third snapshot is created, the root inode pointing to 

6 the active file system is copied into the inode entry 21 IOC for the third snapshot 

7 in the inode file 2110. At the same time in the consistency point that goes 

8 through, a flag indicates that snapshot 3 has been created. The entire file system 

9 is processed by checking if BITO for each entry in the bikmap file is set (1) or 
cleared (0}. All the BITO values for each bikmap entry are copied into the plane 

11 for snapshot three. When completed, every active block 2110-2116 and 1207 in 

12 the file system is in the snapshot at the instant it is taken. 



14 Blocks that have existed on disk continuously for a given length of time are 



ci 15 also present in con-esponding snapshots 21 10-211 OB preceding the third snap- 

■rt t\ 

1* T\ 

16 shot 21 IOC. If a block has been in the file system for a long enough period of 

17 time, it is present in all the snapshots. Block 1207 is such a block. As shown in 

18 Figure 22, block 1207 is referenced by inode 2210G of the active inode file, and 

19 indirectly by snapshots 1 , 2 and 3. 

20 

21 The sequential order of snapshots does not necessarily represent a 

22 chronological sequence of file system copies. Each individual snapshot in a file 

23 system can be deleted at any given time, thereby making an entry available for 



Express Mailing EL 734 815 746 US 



51 



103.1002.12 



1 subsequent use. When BITO of a bikmap entry that references the active file 

2 system is cleared (indicating the block has been deleted from the active file sys- 

3 tem), the block cannot be reused if any of the snapshot reference bits are set. 

4 This is because the block is part of a snapshot that is still in use. A block can 

5 only be reused when all the bits in the bikmap entry are set to zero. 



8 
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Algorithm for Generatino a Snapshot 



9 Creating a snapshot is almost exactly like creating a regular consistency 

10 point as shown in Figure 5. In step 510, all dirty inodes are marked as being in 

11 the consistency point. In step 520, all regular files are flushed to disk. In step 



i 12 530, special files (i.e., the inode file and the bikmap file) are flushed to disk. In 

13 step 540, the fsinfo blocks are flushed to disk. In step 550, all inodes that were 

14 not in the consistency point are processed. Figure 5 is described above in detail. 



0 15 In fact, creating a snapshot is done as part of creating a consistency point. The 

* >» K 

16 primary difference between creating a snapshot and a consistency point is that all 

17 entries of the bikmap file have the active FS-bit copied into the snapshot bit. The 

18 snapshot bit represents the corresponding snapshot in order to protect the blocks 

19 in the snapshot from being ovenA/ritten. The creation and deletion of snapshot is 

20 performed in step 530 because that is the only point where the file system is 

21 completely self-consistent and about to go to disk. 



22 
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Different steps are performed in step 530 then illustrated in Figure 6 for a 



2 consistency point when a new snapshot is created. The steps are very similar to 

3 those for a regular consistency point. Figure 7 is a flow diagram illustrating the 

4 steps that step 530 comprises for creating a snapshot. As described above, step 

5 530 allocates disk space for the bikmap file and the inode file and copies the ac- 

6 five FS-bit into the snapshot bit that represents the corresponding snapshot In or- 

7 der to protect the blocks In the snapshot from being ovenwritten. 



In step 710, the inodes of the bikmap file and the snapshot being created 



i 10 are pre-flushed to disk. In addition to flushing the inode of the bikmap file to a 

11 block of the Inode file (as in step 610 of Figure 6 for a consistency point), the 

1 12 inode of the snapshot being created is also flushed to a block of the inode file. 

■fir . 

L 13 This ensures that the block of the inode file containing the Inode of the snapshot 



14 is dirty. 



16 In step 720, every block in the bikmap file is dirtied. In step 760 (described 

17 below), all entries in the bikmap file are updated instead of just the entries in dirty 

18 blocks. Thus, all blocks of the bikmap file must be marked dirty here to ensure 

19 that step 730 write-allocates disk space for them. 

20 

21 In step 730, disk space is allocated for all dirty blocks in the inode and 

22 bikmap files. The dirty blocks include the block in the inode file containing the 
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1 inode of the bikmap file, which is dirty, and the block containing the inode for the 

2 new snapshot. 



f's 10 

'I SIS 



% 11 



4 In step 740, the contents of the root inode for the file system are copied 

5 into the inode of the snapshot in the inode file. At this time, every block that is 

6 part of the new consistency point and that will be written to disk has disk space 

7 allocated for it. Thus, duplicating the root Inode in the snapshot inode effectively 

8 copies the entire active file system. The actual blocks that will be in the snapshot 

9 are the same blocks of the active file system. 



|l 12 into the inode file. 

13 

5^ ii 



In step 750, the inodes of the bikmap file and the snapshot are copied to 



n 14 In step 760, entries in the bikmap file are updated. In addition to copying 



15 the active FS-bit to the CP-bit for the entries, the active FS-bit is also copied to 

16 the snapshot bit corresponding to the new snapshot. 

17 

18 In step 770, all dirty blocks in the bikmap and inode files are written to disk. 

19 

20 Finally, at some time, snapshots themselves are removed from the file 

21 system in step 760. A snapshot is removed from the file system by clearing its 

22 snapshot inode entry in the inode file of the active file system and clearing each 

23 bit corresponding to the snapshot number in every entry in the bikmap file. A 
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count is performed also of each bit for the snapshot in all the bikmap entries that 



2 are cleared from a set value, thereby providing a count of the blocks that are 

3 fiaed (con-esponding amount of disk space that is freed) by deleting the snap- 

4 shot. The system decides which snapshot to delete on the basis of the oldest 

5 snapshots. Users can also choose to delete specified snapshots manually. 



7 The present invention limits the total number of snapshots and keeps a 

8 bikmap file that has entries with multiple bits for tracking the snapshots instead of 

9 using pointers having a COW bit as in Episode. An unused block has ail zeroes 

10 for the bits in its bikmap file entry. Over time, the BITO for the active file system 

11 is usually turned on at some instant. Setting BITO identifies the corresponding 

12 block as allocated in the active file system. As indicated above, all snapshot bits 

13 are initially set to zero. If the active file bit is cleared before any snapshot bits are 

14 set, the block is not present in any snapshot stored on disk. Therefore, the block 

15 is immediately available for reallocation and cannot be recovered subsequently 

16 from a snapshot. 

17 

18 Generation of a Snapshot 

19 

20 As described previously, a snapshot is very similar to a consistency point. 

21 Therefore, generation of a snapshot is described with reference to the differences 

22 between it and the generation of a consistency point shown in Figures 17A-17L. 

23 Figures 21 A-21 F illustrates the differences for generating a snapshot. 
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Figures 17A-17D Illustrate the state of the WAFL file system when a snap- 
shot Is begun. All dirty inodes are marked as being In the consistency point In 
step 510 and regular files are flushed to disk In step 520. Thus, Initial processing 
of a snapshot Is identical to that for a consistency point. Processing for a snap- 
shot differs in step 530 fi-om that for a consistency point. The following describes 
processing of a snapshot according to Figure 7. 

The following description Is for a second snapshot of the WAFL file system. 
A first snapshot Is recorded In the bikmap entries of Figure 17C. As Indicated In 
entries 2324A-2324M, blocks 2304-2306, 2310-2320, and 2324 are contained In 
the first snapshot. All other snapshot bits (BIT1-BIT20) are assumed to have val- 
ues of 0 indicating that a conresponding snapshot does not exist on disk. Figure 
21 A illustrates the file system after steps 510 and 520 are completed. 

In step 710, Inodes 2308C and 2308D of snapshot 2 and bikmap file 2344 
are pre-flushed to disk. This ensures that the block of the inode file that is going 
to contain the snapshot 2 inode Is dirty. In Figure 21 B, inodes 2308C and 2308D 
are pre-flushed for snapshot 2 and for bikmap file 2344. 

In step 720, the entire bikmap file 2344 Is dirtied. This will cause the entire 
bikmap file 2344 to be allocated disk space in step 730. In step 730, disk space 
is allocated for dirty blocks 2308 and 2326 for inode file 2346 and bikmap file 
2344 as shown in Figure 21 C. This Is indicated by a double asterisk (**) beside 
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1 blocks 2308 and 2326. This is different from generating a consistency point 

2 wfiere disk space is allocated only for blocks having entries that have changed in 

3 the bikmap file 2344 in step 620 of Figure 6. BIkmap file 2^44 of Figure 21 C 

4 comprises a single block 2324. However, when bikmap file 2344 comprises more 

5 than one block, disk space is allocated for ail the blocks in step 730. 



StS- 



7 In step 740, the root inode for the new file system is copied into inode 

8 2308D for snapshot 2. In step 750, the inodes 2308C and 2308D of bikmap file 

9 2344 and snapshot 2 are flushed to the inode file as Illustrated in Figure 21 D. 



■J 



i 10 The diagram illustrates that snapshot 2 inode 2308D references blocks 2304 and 



J 11 2308 but not block 2306. 



Tf W. J\ 



#12 



f *^ 



13 In step 760, entries 2326A-2326L in block 2326 of the bikmap file 2344 are 



ftj 14 updated as illustrated in Figure 21 E. The diagram illustrates that the snapshot 2 

15 bit (BIT2) is updated as well as the FS-BIT and CP-BIT for each entry 2326A- 

16 2326L. Thus, blocks 2304, 2308-2312, 2316-2318, 2322, and 2326 are con- 

17 tained in snapshot 2 whereas blocks 2306, 2314, 2320, and 2324 are not. In step 

18 770, the dirty blocks 2308 and 2326 are written to disk. 

19 

20 Further processing of snapshot 2 Is identical to that for generation of a 

21 consistency point illustrated in Figure 5. In step 540, the two fsinfo blocks are 

22 flushed to disk. Thus, Figure 21 F represents the WAFL file system in a consis- 

23 tent state after this step. Files 2340, 2342, 2344, and 2346 of the consistent file 
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1 system, after step 540 Is completed, are indicated within dotted lines in Figure 

2 21 F. In step 550, the consistency point is completed by processing inodes that 

3 were not In the consistency point. 



15 



^ F\ 



Access Time Overwrites 



7 Unix file systems must maintain an "access time" (atlme) in each Inode. 

8 Atime Indicates the last time that the file was read. It is updated every time the 

9 file Is accessed. Consequently, when a file Is read the block that contains the 

10 inode In the Inode file Is rewritten to update the Inode. This could be disadvanta- 



S 11 geous for creating snapshots because; as a consequence, reading a file could 



1 12 potentially use up disk space. Further, reading all the files In the file system could 

V 

13 cause the entire inode file to be duplicated. The present Invention solves this 



14 problem. 



16 Because of atime, a read could potentially consume disk space since 

17 modifying an Inode causes a new block for the inode file to written on disk. 

18 Further, a read operation could potentially fail If a file system Is fijil which is an 

19 abnormal condition for a file system to have occur. 

20 

21 In general, data on disk is not overwritten in the WAFL file system so as to 

22 protect data stored on disk. The only exception to this rule is atime ovenwrites for 

23 an inode as illustrated in Figures 23A-23B: When an "atime overwrites" occurs, 
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1 the only data that is modified in a blocl< of the inode file is the atime of one or 

2 more of the inodes it contains and the block is rewritten in the same location. 

3 This is the only exception in the WAFL system; othenwise new data is always 

4 written to new disk locations. 



I' i' 

? 



6 In Figure 23A, the atimes 2423 and 2433 of an Inode 2422 in an old WAFL 

7 inode file block 2420 and the snapshot inode 2432 that references block 2420 are 

8 illustrated. Inode 2422 of block 2420 references direct block 2410. The atime 

9 2423 of inode 2422 Is "4/30 9:15 PM" whereas the atime 2433 of snapshot inode 

10 2432 is "5/1 10:00 AM". Figure 23A illustrates ttie file system before direct buffer 

11 2410 Is accessed. 



U '1 2 



13 Figure 23B illustrates the inode 2422 of direct block 241 0 after direct block 

It is 

14 2410 has been accessed. As shown in the diagram, the access time 2423 of 

itt *• 
J*f l\ ■ 

CI 15 inode 2422 is overwritten with the access time 2433 of snapshot 2432 that refer- 

16 ences it. Thus, tiie access time 2423 of inode 2422 for direct block 2410 is "5/1 

17 11:23 AM". 

18 

19 Allowing inode file blocks to be overwritten with new atimes produces a 

20 slight inconsistency in the snapshot. The atime of a file in a snapshot can actu- 

21 ally be later than the time that the snapshot was created. In order to prevent us- 

22 ers from detecting this inconsistency, WAFL adjusts the atime of all files in a 

23 snapshot to the time when the snapshot was actually created instead of the time 
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a file was last accessed. This snapshot time is stored in the inode that describes 
the snapshot as a whole. Thus, when accessed via the snapshot, the access 
time 2423 for inode 2422 is always reported as "5/1 10:00AM". This occurs both 
before the update when it may be expected to be "4/30 9:15PM", and after the 
update when it may be expected to be "5/1 1 1 :23AM". When accessed through 
the active file system, the times are reported as "4/30 9:15PM" and "5/1 
11:23AM" before and after the update, respectively. 

In this manner, a method is disclosed for maintaining a file system in a 
consistent state and for creating read-only copies of the file system. 
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